My Netflix's viewing activities¶

Data Skills¶

  1. Data Transformation: change data type (using datetime), calculate new metrics, extract strings (using regex)
  2. Data Aggregation (grouping and summary)
  3. Data Visualization (using matplotlib)

Questions to Answer¶

  1. Which user watches the most on Netflix?
  2. How much time does each user spend on average?
  3. Do two users watch the same show?
  4. What are the binge watching records for a particular user?
In [ ]:
import pandas as pd
import datetime
df = pd.read_csv('netflix-report\CONTENT_INTERACTION\ViewingActivity.csv')

Data Exploration¶

In [ ]:
# Understand data structure: 6113 rows and 10 columns
df.shape
Out[ ]:
(6113, 10)
In [ ]:
# Preview data
df.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Supplemental Video Type Device Type Bookmark Latest Bookmark Country
0 AlyCha 2024-08-09 05:24:34 00:10:41 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:22:34 00:22:34 VN (Viet Nam)
1 AlyCha 2024-08-09 01:41:23 00:02:23 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:11:53 Not latest view VN (Viet Nam)
2 AlyCha 2024-08-09 01:28:07 00:09:10 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:09:29 Not latest view VN (Viet Nam)
3 AlyCha 2024-08-09 01:09:56 00:18:59 NaN The Big Bang Theory: Season 12: The Change Con... NaN iPhone 12 00:19:19 00:19:19 VN (Viet Nam)
4 AlyCha 2024-08-09 00:23:03 00:19:36 NaN The Big Bang Theory: Season 12: The Maternal C... NaN iPhone 12 00:19:56 00:19:56 VN (Viet Nam)
In [ ]:
# See unique profile names
df["Profile Name"].unique()
Out[ ]:
array(['AlyCha', 'Chi', 'Mom'], dtype=object)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6113 entries, 0 to 6112
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Profile Name             6113 non-null   object
 1   Start Time               6113 non-null   object
 2   Duration                 6113 non-null   object
 3   Attributes               1836 non-null   object
 4   Title                    6113 non-null   object
 5   Supplemental Video Type  1124 non-null   object
 6   Device Type              6113 non-null   object
 7   Bookmark                 6113 non-null   object
 8   Latest Bookmark          6113 non-null   object
 9   Country                  6113 non-null   object
dtypes: object(10)
memory usage: 477.7+ KB

Data Transformation¶

In [ ]:
# Transform Start Time to datetime format and verify
df['Start Time'] = pd.to_datetime(df['Start Time'], utc=True)
df.dtypes
Out[ ]:
Profile Name                            object
Start Time                 datetime64[ns, UTC]
Duration                                object
Attributes                              object
Title                                   object
Supplemental Video Type                 object
Device Type                             object
Bookmark                                object
Latest Bookmark                         object
Country                                 object
dtype: object
In [ ]:
# Transform Duration to seconds
df['Hours'] = df['Duration'].str.slice(0,2).astype('int64')
df['Minutes'] = df['Duration'].str.slice(3,5).astype('int64')
df['Seconds'] = df['Duration'].str.slice(6,8).astype('int64')
df['Duration'] = df['Hours'] * 3600 + df['Minutes'] * 60 + df['Seconds']
In [ ]:
# Remove helper columns
df = df.drop(columns=['Hours','Minutes', 'Seconds'], axis=1)
df.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Supplemental Video Type Device Type Bookmark Latest Bookmark Country
0 AlyCha 2024-08-09 05:24:34+00:00 641 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:22:34 00:22:34 VN (Viet Nam)
1 AlyCha 2024-08-09 01:41:23+00:00 143 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:11:53 Not latest view VN (Viet Nam)
2 AlyCha 2024-08-09 01:28:07+00:00 550 NaN The Big Bang Theory: Season 12: The Stockholm ... NaN iPhone 12 00:09:29 Not latest view VN (Viet Nam)
3 AlyCha 2024-08-09 01:09:56+00:00 1139 NaN The Big Bang Theory: Season 12: The Change Con... NaN iPhone 12 00:19:19 00:19:19 VN (Viet Nam)
4 AlyCha 2024-08-09 00:23:03+00:00 1176 NaN The Big Bang Theory: Season 12: The Maternal C... NaN iPhone 12 00:19:56 00:19:56 VN (Viet Nam)

Data Analysis¶

Q1: Which profile has the most viewing activities?

In [ ]:
df['Profile Name'].value_counts()
# Alyssa is the winner
Out[ ]:
Profile Name
AlyCha    2584
Mom       1940
Chi       1589
Name: count, dtype: int64

Q2: Which profile watches the most time?

In [ ]:
# To put into context, the data was collected from Oct 2020 through Aug 2024 (~ 4 years)
df['Start Time'].agg(['min', 'max'])
Out[ ]:
min   2020-10-17 14:11:50+00:00
max   2024-08-15 11:30:14+00:00
Name: Start Time, dtype: datetime64[ns, UTC]
In [ ]:
viewTime = df.groupby('Profile Name')['Duration'].sum()
viewTime = viewTime.to_frame().reset_index()
viewTime['Duration'] = viewTime['Duration']/3600
viewTime['Duration'] = viewTime['Duration'].round().astype('int64')
viewTime.sort_values(by='Duration', ascending=False)
# Mom watches the most (contrary to what she claimed :))
Out[ ]:
Profile Name Duration
2 Mom 842
0 AlyCha 564
1 Chi 392
In [ ]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# Visualize data using bar chart
viewTime.sort_values(by='Duration', ascending=False).plot(kind="bar", x="Profile Name", xlabel='', y='Duration', ylabel='Hours Watched', title="Watch time per Profile", legend=False, color='red', figsize=(4,3))
plt.xticks(rotation=360, horizontalalignment="center")
plt.gca().set_facecolor('black')
No description has been provided for this image

Q3: Find mutual titles that Chi and Alyssa watched

In [ ]:
# Compile a list of unique titles that Alyssa watched
Aly_shows = df[(df['Profile Name'] == 'AlyCha') & (df['Supplemental Video Type'].isnull())]['Title'].unique()
In [ ]:
# Compile a list of unique titles that Chi watched
Chi_shows = df[(df['Profile Name'] == 'Chi') & (df['Supplemental Video Type'].isnull())]['Title'].unique()
In [ ]:
# Loop through two lists to find overlapping titles
mutual_shows = []
for i in Chi_shows:
    for j in Aly_shows:
        if i == j:
            mutual_shows.append(i)

print(f'Alyssa and Chi have watched {len(mutual_shows)} titles in common.')
print('Some of them are:')
import random
for i in random.sample(mutual_shows, 5):
    print(f'* {i}')
Alyssa and Chi have watched 236 titles in common.
Some of them are:
* The Big Bang Theory: Season 6: The 43 Peculiarity (Episode 8)
* The Big Bang Theory: Season 7: The Proton Transmogrification (Episode 22)
* SPY x FAMILY: Season 1: Operation Strix (Episode 1)
* The Big Bang Theory: Season 7: The Locomotive Manipulation (Episode 15)
* The Big Bang Theory: Season 10: The Romance Recalibration (Episode 13)

Q4: Show average watch time per user excluding trailers and hooks

In [ ]:
# Extract only true watching activities (excluding trailers and hooks)
watch = df[df['Supplemental Video Type'].isnull()]
watch = watch.drop('Supplemental Video Type', axis=1)
watch.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Device Type Bookmark Latest Bookmark Country
0 AlyCha 2024-08-09 05:24:34+00:00 641 NaN The Big Bang Theory: Season 12: The Stockholm ... iPhone 12 00:22:34 00:22:34 VN (Viet Nam)
1 AlyCha 2024-08-09 01:41:23+00:00 143 NaN The Big Bang Theory: Season 12: The Stockholm ... iPhone 12 00:11:53 Not latest view VN (Viet Nam)
2 AlyCha 2024-08-09 01:28:07+00:00 550 NaN The Big Bang Theory: Season 12: The Stockholm ... iPhone 12 00:09:29 Not latest view VN (Viet Nam)
3 AlyCha 2024-08-09 01:09:56+00:00 1139 NaN The Big Bang Theory: Season 12: The Change Con... iPhone 12 00:19:19 00:19:19 VN (Viet Nam)
4 AlyCha 2024-08-09 00:23:03+00:00 1176 NaN The Big Bang Theory: Season 12: The Maternal C... iPhone 12 00:19:56 00:19:56 VN (Viet Nam)
In [ ]:
# Group watchtime by Profile and show average watchtime in minutes
summary = watch.groupby('Profile Name')['Duration'].mean().round()
summary = summary.to_frame().reset_index()
summary['Avg Duration (min)'] = summary['Duration']/60
summary['Avg Duration (min)'] = summary['Avg Duration (min)'].round().astype('int')
summary = summary[['Profile Name', 'Avg Duration (min)']].sort_values(by='Avg Duration (min)', ascending=False)
summary 
Out[ ]:
Profile Name Avg Duration (min)
2 Mom 29
1 Chi 19
0 AlyCha 17
In [ ]:
# Visualize results in bar chart
summary.plot(kind='bar', x='Profile Name', y = 'Avg Duration (min)', xlabel='', ylabel='Minutes', title=" Average Watch time per Profile", legend=False, color='red', figsize=(4,3))
plt.xticks(rotation=360, horizontalalignment="center")
plt.gca().set_facecolor('black')
No description has been provided for this image

Q5: Binge records for Chi

Define binge watching: Watch 3+ episodes of the same series in the same day

In [ ]:
# Create a new dataframe for Chi's viewing activities
chi = watch[watch['Profile Name'] == 'Chi']
chi.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Device Type Bookmark Latest Bookmark Country
2584 Chi 2024-08-15 11:30:14+00:00 2765 NaN Downton Abbey: Series 2: Episode 5 DefaultWidevineAndroidPhone 00:54:45 00:54:45 FR (France)
2585 Chi 2024-08-15 10:18:36+00:00 3078 NaN Downton Abbey: Series 2: Episode 4 Chrome PC (Cadmium) 00:53:40 00:53:40 FR (France)
2591 Chi 2024-08-14 17:17:43+00:00 1273 NaN The Umbrella Academy: Season 4: End of the Beg... Chrome PC (Cadmium) 01:07:17 01:07:17 FR (France)
2593 Chi 2024-08-14 11:32:33+00:00 2521 NaN The Umbrella Academy: Season 4: End of the Beg... DefaultWidevineAndroidPhone 00:43:48 Not latest view FR (France)
2594 Chi 2024-08-14 11:02:00+00:00 1826 NaN The Umbrella Academy: Season 4: Six Years, Fiv... DefaultWidevineAndroidPhone 00:54:52 00:54:52 FR (France)
In [ ]:
# Extract only show title
chi.loc[:,'Show'] = chi['Title'].str.extract(r'(.+?(?=:))')
chi.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Device Type Bookmark Latest Bookmark Country Show
2584 Chi 2024-08-15 11:30:14+00:00 2765 NaN Downton Abbey: Series 2: Episode 5 DefaultWidevineAndroidPhone 00:54:45 00:54:45 FR (France) Downton Abbey
2585 Chi 2024-08-15 10:18:36+00:00 3078 NaN Downton Abbey: Series 2: Episode 4 Chrome PC (Cadmium) 00:53:40 00:53:40 FR (France) Downton Abbey
2591 Chi 2024-08-14 17:17:43+00:00 1273 NaN The Umbrella Academy: Season 4: End of the Beg... Chrome PC (Cadmium) 01:07:17 01:07:17 FR (France) The Umbrella Academy
2593 Chi 2024-08-14 11:32:33+00:00 2521 NaN The Umbrella Academy: Season 4: End of the Beg... DefaultWidevineAndroidPhone 00:43:48 Not latest view FR (France) The Umbrella Academy
2594 Chi 2024-08-14 11:02:00+00:00 1826 NaN The Umbrella Academy: Season 4: Six Years, Fiv... DefaultWidevineAndroidPhone 00:54:52 00:54:52 FR (France) The Umbrella Academy
In [ ]:
# Extract date from watch time
from datetime import datetime
chi.loc[:,'Date'] = chi['Start Time'].dt.date
chi.head()
Out[ ]:
Profile Name Start Time Duration Attributes Title Device Type Bookmark Latest Bookmark Country Show Date
2584 Chi 2024-08-15 11:30:14+00:00 2765 NaN Downton Abbey: Series 2: Episode 5 DefaultWidevineAndroidPhone 00:54:45 00:54:45 FR (France) Downton Abbey 2024-08-15
2585 Chi 2024-08-15 10:18:36+00:00 3078 NaN Downton Abbey: Series 2: Episode 4 Chrome PC (Cadmium) 00:53:40 00:53:40 FR (France) Downton Abbey 2024-08-15
2591 Chi 2024-08-14 17:17:43+00:00 1273 NaN The Umbrella Academy: Season 4: End of the Beg... Chrome PC (Cadmium) 01:07:17 01:07:17 FR (France) The Umbrella Academy 2024-08-14
2593 Chi 2024-08-14 11:32:33+00:00 2521 NaN The Umbrella Academy: Season 4: End of the Beg... DefaultWidevineAndroidPhone 00:43:48 Not latest view FR (France) The Umbrella Academy 2024-08-14
2594 Chi 2024-08-14 11:02:00+00:00 1826 NaN The Umbrella Academy: Season 4: Six Years, Fiv... DefaultWidevineAndroidPhone 00:54:52 00:54:52 FR (France) The Umbrella Academy 2024-08-14
In [ ]:
chi[chi['Show']=='The Good Doctor'].sort_values(by='Date').head()
# After examination, I realized that each row records a time I opened the player, so it counts in the times when I had to reopen the video or switch to another device
# Solution: filter only sessions with watch time > 15 minutes (900s)
Out[ ]:
Profile Name Start Time Duration Attributes Title Device Type Bookmark Latest Bookmark Country Show Date
4172 Chi 2021-07-25 15:07:29+00:00 116 NaN The Good Doctor: Season 1: Burnt Food (Episode 1) Chrome PC (Cadmium) 00:01:56 Not latest view VN (Viet Nam) The Good Doctor 2021-07-25
4171 Chi 2021-07-25 15:18:20+00:00 151 NaN The Good Doctor: Season 1: Burnt Food (Episode 1) DefaultWidevineAndroidTablets 00:04:36 Not latest view VN (Viet Nam) The Good Doctor 2021-07-25
4163 Chi 2021-08-16 16:24:12+00:00 91 NaN The Good Doctor: Season 1: Burnt Food (Episode 1) Google Smart Display Nest Hub 2018 00:13:48 Not latest view VN (Viet Nam) The Good Doctor 2021-08-16
4162 Chi 2021-08-16 16:26:20+00:00 4 NaN The Good Doctor: Season 1: Burnt Food (Episode 1) Google Smart Display Nest Hub 2018 00:13:42 Not latest view VN (Viet Nam) The Good Doctor 2021-08-16
4161 Chi 2021-08-16 16:27:12+00:00 8 NaN The Good Doctor: Season 1: Burnt Food (Episode 1) DefaultWidevineAndroidTablets 00:13:55 00:13:55 VN (Viet Nam) The Good Doctor 2021-08-16
In [ ]:
# Extract an aggregated summary of watches per show
binge = chi[['Date','Show','Duration']][chi['Duration'] > 900]
binge = binge.groupby(['Date','Show']).agg(Count = ('Show','count'), TotalTime = ('Duration','sum')).reset_index()
binge.head()
Out[ ]:
Date Show Count TotalTime
0 2021-08-16 The Good Doctor 1 2531
1 2021-08-18 The Good Doctor 2 5019
2 2021-08-19 The Good Doctor 1 2558
3 2021-08-20 The Good Doctor 2 4940
4 2021-08-21 The Good Doctor 3 5094
In [ ]:
#Top 10 binge by number of episodes watched in a day
Top10byEp = binge.sort_values(by='Count', ascending=False).head(10).reset_index(drop=True)
Top10byEp.rename(columns={'Count':'No.Episodes'}, inplace=True)
Top10byEp = Top10byEp.drop('TotalTime', axis=1)
Top10byEp
Out[ ]:
Date Show No.Episodes
0 2021-09-10 The Big Bang Theory 6
1 2023-02-21 The Hook Up Plan 5
2 2021-10-08 The Big Bang Theory 5
3 2021-09-17 The Big Bang Theory 5
4 2021-09-27 The Big Bang Theory 5
5 2022-01-10 The Big Bang Theory 5
6 2022-04-24 Heartstopper 5
7 2021-10-02 The Big Bang Theory 5
8 2021-08-29 Lupin 5
9 2021-10-10 The Big Bang Theory 5
In [ ]:
#Top 10 binge by total time watched in a day (in hours)
Top10byTime = binge.sort_values(by='TotalTime', ascending=False).head(10).reset_index(drop=True)
Top10byTime['TotalTime'] = (Top10byTime['TotalTime']/3600).round(1).astype('float')
Top10byTime
Out[ ]:
Date Show Count TotalTime
0 2021-08-29 Lupin 5 3.0
1 2023-10-06 Sex Education 4 3.0
2 2022-06-13 Venom 3 2.6
3 2021-09-02 Bridgerton 3 2.6
4 2023-10-09 Sex Education 3 2.4
5 2024-06-17 Bridgerton 3 2.3
6 2021-09-05 Bridgerton 4 2.3
7 2024-08-14 The Umbrella Academy 5 2.3
8 2022-07-03 Stranger Things 2 2.3
9 2021-08-28 Lupin 3 2.2